NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

https://doi.org/10.18653/v1/2020.acl-main.442

Ribeiro, Marco Tulio; Wu, Tongshuang; Guestrin, Carlos; Singh, Sameer (January 2020, Annual Meeting of the Association for Computational Linguistics)

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
more » « less
Full Text Available
Are Red Roses Red? Evaluating Consistency of Question-Answering Models

https://doi.org/10.18653/v1/P19-1621

Ribeiro, Marco Tulio; Guestrin, Carlos; Singh, Sameer (July 2019, Association for Computational Linguistics (ACL))

Although current evaluation of question-answering systems treats predictions in isolation, we need to consider the relationship between predictions to measure true understanding. A model should be penalized for answering “no” to “Is the rose red?” if it answers “red” to “What color is the rose?”. We propose a method to automatically extract such implications for instances from two QA datasets, VQA and SQuAD, which we then use to evaluate the consistency of models. Human evaluation shows these generated implications are well formed and valid. Consistency evaluation provides crucial insights into gaps in existing models, while retraining with implication-augmented data improves consistency on both synthetic and human-generated implications.
more » « less
Full Text Available
A Hardware–Software Blueprint for Flexible Deep Learning Specialization

https://doi.org/10.1109/MM.2019.2928962

Moreau, Thierry; Chen, Tianqi; Vega, Luis; Roesch, Jared; Yan, Eddie; Zheng, Lianmin; Fromm, Josh; Jiang, Ziheng; Ceze, Luis; Guestrin, Carlos; et al (September 2019, IEEE Micro)

Full Text Available
Semantically Equivalent Adversarial Rules for Debugging NLP models

https://doi.org/10.18653/v1/P18-1079

Ribeiro, Marco Tulio; Singh, Sameer; Guestrin, Carlos (January 2018, Annual Meeting of the Association for Computational Linguistics (ACL))

Complex machine learning models for NLP are often brittle, making different predictions for input instances that are extremely similar semantically. To automatically detect this behavior for individual instances, we present semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions. We generalize these adversaries into semantically equivalent adversarial rules (SEARs) – simple, universal replacement rules that induce adversaries on many instances. We demonstrate the usefulness and flexibility of SEAs and SEARs by detecting bugs in black-box state-of-the-art models for three domains: machine comprehension, visual question-answering, and sentiment analysis. Via user studies, we demonstrate that we generate high-quality local adversaries for more instances than humans, and that SEARs induce four times as many mistakes as the bugs discovered by human experts. SEARs are also actionable: retraining models using data augmentation significantly reduces bugs, while maintaining accuracy.
more » « less
Full Text Available
Disparities in Hemoglobin A _1c Levels in the First Year After Diagnosis Among Youths With Type 1 Diabetes Offered Continuous Glucose Monitoring

https://doi.org/10.1001/jamanetworkopen.2023.8881

Addala, Ananta; Ding, Victoria; Zaharieva, Dessi P.; Bishop, Franziska K.; Adams, Alyce S.; King, Abby C.; Johari, Ramesh; Scheinker, David; Hood, Korey K.; Desai, Manisha; et al (April 2023, JAMA Network Open)

Importance Continuous glucose monitoring (CGM) is associated with improvements in hemoglobin A 1c (HbA 1c ) in youths with type 1 diabetes (T1D); however, youths from minoritized racial and ethnic groups and those with public insurance face greater barriers to CGM access. Early initiation of and access to CGM may reduce disparities in CGM uptake and improve diabetes outcomes. Objective To determine whether HbA 1c decreases differed by ethnicity and insurance status among a cohort of youths newly diagnosed with T1D and provided CGM. Design, Setting, and Participants This cohort study used data from the Teamwork, Targets, Technology, and Tight Control (4T) study, a clinical research program that aims to initiate CGM within 1 month of T1D diagnosis. All youths with new-onset T1D diagnosed between July 25, 2018, and June 15, 2020, at Stanford Children’s Hospital, a single-site, freestanding children’s hospital in California, were approached to enroll in the Pilot-4T study and were followed for 12 months. Data analysis was performed and completed on June 3, 2022. Exposures All eligible participants were offered CGM within 1 month of diabetes diagnosis. Main Outcomes and Measures To assess HbA 1c change over the study period, analyses were stratified by ethnicity (Hispanic vs non-Hispanic) or insurance status (public vs private) to compare the Pilot-4T cohort with a historical cohort of 272 youths diagnosed with T1D between June 1, 2014, and December 28, 2016. Results The Pilot-4T cohort comprised 135 youths, with a median age of 9.7 years (IQR, 6.8-12.7 years) at diagnosis. There were 71 boys (52.6%) and 64 girls (47.4%). Based on self-report, participants’ race was categorized as Asian or Pacific Islander (19 [14.1%]), White (62 [45.9%]), or other race (39 [28.9%]); race was missing or not reported for 15 participants (11.1%). Participants also self-reported their ethnicity as Hispanic (29 [21.5%]) or non-Hispanic (92 [68.1%]). A total of 104 participants (77.0%) had private insurance and 31 (23.0%) had public insurance. Compared with the historical cohort, similar reductions in HbA 1c at 6, 9, and 12 months postdiagnosis were observed for Hispanic individuals (estimated difference, −0.26% [95% CI, −1.05% to 0.43%], −0.60% [−1.46% to 0.21%], and −0.15% [−1.48% to 0.80%]) and non-Hispanic individuals (estimated difference, −0.27% [95% CI, −0.62% to 0.10%], −0.50% [−0.81% to −0.11%], and −0.47% [−0.91% to 0.06%]) in the Pilot-4T cohort. Similar reductions in HbA 1c at 6, 9, and 12 months postdiagnosis were also observed for publicly insured individuals (estimated difference, −0.52% [95% CI, −1.22% to 0.15%], −0.38% [−1.26% to 0.33%], and −0.57% [−2.08% to 0.74%]) and privately insured individuals (estimated difference, −0.34% [95% CI, −0.67% to 0.03%], −0.57% [−0.85% to −0.26%], and −0.43% [−0.85% to 0.01%]) in the Pilot-4T cohort. Hispanic youths in the Pilot-4T cohort had higher HbA 1c at 6, 9, and 12 months postdiagnosis than non-Hispanic youths (estimated difference, 0.28% [95% CI, −0.46% to 0.86%], 0.63% [0.02% to 1.20%], and 1.39% [0.37% to 1.96%]), as did publicly insured youths compared with privately insured youths (estimated difference, 0.39% [95% CI, −0.23% to 0.99%], 0.95% [0.28% to 1.45%], and 1.16% [−0.09% to 2.13%]). Conclusions and Relevance The findings of this cohort study suggest that CGM initiation soon after diagnosis is associated with similar improvements in HbA 1c for Hispanic and non-Hispanic youths as well as for publicly and privately insured youths. These results further suggest that equitable access to CGM soon after T1D diagnosis may be a first step to improve HbA 1c for all youths but is unlikely to eliminate disparities entirely. Trial Registration ClinicalTrials.gov Identifier: NCT04336969
more » « less
Full Text Available

Search for: All records